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Abstract 

This  paper  describes  a  refinement  to  our  procedure  for  porting  lexical  conceptual  struc¬ 
ture  into  new  languages.  Specifically  we  describe  a  two-step  process  for  creating  candidate 
thematic  grids  for  Mandarin  Chinese  verbs,  using  the  English  verb  heading  the  VP  in  the 
subdefinitions  to  separate  senses,  and  roughly  parsing  the  verb  complement  structure  to 
match  to  our  thematic  structure  templates.  The  procedure  is  part  of  a  larger  process  of 
creating  a  usable  lexicon  for  interlingual  machine  translation  from  a  large  on-line  resource 
with  both  too  much  and  too  little  information  necessary  for  our  system. 
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Abstract.  This  paper  describes  a  rehnement  to  our  procedure  for  port¬ 
ing  lexical  conceptual  structure  (LCS)  into  new  languages.  Specihcally 
we  describe  a  two-step  process  for  creating  candidate  thematic  grids  for 
Mandarin  Chinese  verbs,  using  the  English  verb  heading  the  VP  in  the 
subdehnitions  to  separate  senses,  and  roughly  parsing  the  verb  comple¬ 
ment  structure  to  match  thematic  structure  templates.  We  accomplished 
a  substantial  reduction  in  manual  effort,  without  substantive  loss.  The 
procedure  is  part  of  a  larger  process  of  creating  a  usable  lexicon  for  in¬ 
terlingual  machine  translation  from  a  large  on-line  resource  with  both 
too  much  and  too  little  information. 


1  Introduction 

In  previous  work  on  Spanish  and  Arabic  (Dorr  et  ah,  1997;  Dorr,  1997a),  we 
reported  the  results  of  an  acquisition  process  for  verb  databases  in  new  lan¬ 
guages,  using  automatic  assignment  of  candidate  thematic  structure  templates 
(“grids”)  and  manual  verification  of  the  output.  This  paper  reports  on  acqui¬ 
sition  of  a  Mandarin  Chinese  verb  database  from  an  on-line  resource  ten  times 
as  large  as  those  used  for  Spanish  and  Arabic  (600k,  rather  than  60k  entries). 
The  procedure  is  part  of  a  larger  process  of  creating  a  usable  lexicon  for  interlin¬ 
gual  machine  translation  from  a  large  on-line  resource  with  both  too  much  and 
too  little  information  necessary  for  our  interlingual  machine  translation  system 
(Dorr,  1997b;  Hogan  and  Levin,  1994). 

The  major  contributions  of  this  work  are:  (i)  reducing  the  effect  of  polysemy 
by  addressing  it  in  the  preprocessing  phase,  and  (ii)  removing  a  substantial  sub¬ 
set  of  automatically  generated  thematic  grids  requiring  manual  correction,  by 
relating  thematic  information  incorporated  in  Chinese  verbs  to  overt  comple¬ 
ments  in  English.  Both  of  the  above  result  in  a  reduction  of  11%  of  material 
that  needs  to  be  corrected  by  hand:  15,565  possible  candidate  thematic  grids. 
Furthermore,  the  separation  of  senses  allows  candidate  grids  to  be  evaluated 
with  respect  to  a  particular  sense.  Noise  that  is  introduced  from  polysemy  in  the 
English  glosses — by  putting  the  grids  from  the  ‘run  a  machine’  and  ‘run  a  race’ 
senses  into  a  “bag  of  grids” — may  therefore  be  eliminated,  as  grids  are  tied  to 


a  particular  sense.  The  reduction  of  manual  effort  has  been  accomplished  with¬ 
out  substantive  loss  of  relevant  definitions,  as  evaluated  in  a  preliminary  task, 
assigning  thematic  grids  to  verbs  in  10  sentences  from  a  corpus  of  10  Xinhua 
articles. 

We  see  this  work  as  a  step  beyond  that  suggested  by  (Dorr  et  ah,  1997),  in 
which  manual  correction  took  place  without  first  reducing  the  degree  of  ambigu¬ 
ity  in  the  entry.  Dorr  et  al.  generated  18353  candidate  thematic  grids,  represent¬ 
ing  3623  verbs  in  the  initial  Spanish-English  lexicon.  Of  that,  3025  entries  were 
verified  as  correct  (16.5%),  and  15328  (83.5%)  had  to  be  modified  in  some  way. 
There  were  6082  deletions  of  entries,  334  reclassifications  (resulting  in  changes 
of  entries)  and  6295  refinements  of  entries.  The  refinements  included  3648  dele¬ 
tions  of  non-applicable  entries;  2747  changes  to  prepositions,  optional  roles  made 
obligatory,  etc.;  2617  entries  (955  verbs)  deleted  due  to  rarity  of  usage  and/or 
disjointness  with  respect  to  WordNet  concepts;  1213  new  entries  added  (repre¬ 
senting  1092  verbs  not  in  the  initial  database).  That  is,  there  were  a  total  of 
9730  deletions,  representing  63.5%  of  the  required  modifications,  and  53%  of  the 
total  number  of  candidate  grids.  Thus,  an  automatic  process  that  reduces  the 
number  of  deletions  in  a  principled  way  would  substantially  reduce  the  manual 
correction  process. 

In  the  next  section  we  describe  the  role  of  thematic  grids  in  our  system. 
We  then  describe  our  lexicon  acquisition  procedure,  with  respect  to  the  verbs, 
detailing  how  we  attempted  to  deal  with  polysemy  and  overgeneration  of  grids. 
We  also  report  on  other  issues  that  arose  in  adapting  the  on-line  resources. 

2  Thematic  Structure:  Grids 

Thematic  structure  serves  as  the  interface  between  the  syntactic  component 
(parsing)  and  the  lexical-semantic  component,  the  Lexical  Conceptual  Structure 
(LCS).  Verbs  are  assigned  to  classes  that  share  syntactic  and  semantic  behav¬ 
ior.  That  is,  verbs  in  a  class  appear  in  the  same  types  of  sentences,  with  the 
same  syntactic  and  semantic  type  of  complements,  represented  as  thematic  (or 
“theta”)  roles.  The  syntactic  and  semantic  behavior  is  abbreviated  in  the  form 
of  thematic  grids,  consisting  of  lists  of  obligatory  and  optional  thematic  roles, 
including  agent,  theme  (patient),  experiencer,  source,  goal  and  location. 

In  thematic  grids,  roles  preceded  by  an  underscore  (_)  are  obligatory,  and 
those  preceded  by  a  comma  (,)  are  optional.  A  set  of  parentheses  ()  indicates 
that  the  role  must  be  expressed  as  a  complement  of  a  preposition  or  comple¬ 
mentizer  (e.g.  the  infinitival  to  in  English).  If  the  preposition  is  indicated,  that 
preposition  must  be  the  head  of  the  phrase.  Eor  example,  the  thematic  grid 
ag_th,  src(from)  .goal (to)  indicates  that  agent  and  theme  are  obligatory,  and 
source  and  goal  are  optional,  and  must  be  expressed  by  from  and  to  prepositional 
phrases,  respectively.  Assigning  this  grid  to  the  Send  verbs,  for  example  (class 
11.1  in  (Levin,  1993)),  allows  these  verbs  to  appear  in  sentences  like  (l)-(4),  but 
not  (5)-(8),  since  the  obligatory  theme  argument  is  missing. 


(1)  I  sent  the  book. 

(2)  I  sent  the  book  to  Mary 

(3)  I  sent  the  book  from  the  warehouse. 

(4)  I  sent  the  book  from  the  warehouse  to  Mary. 

(5)  *  I  sent. 

(6)  *  I  sent  to  Mary. 

(7)  *  I  sent  from  the  warehouse. 

(8)  *  I  sent  from  the  warehouse  to  Mary. 

The  thematic  roles  map  directly  into  numbers,  representing  variables  in  the 
LCS.  Although  theta  roles  are  theoretically  unordered  (Rappaport  and  Levin, 
1988),  the  numbers  correspond  to  a  “canonical”  linear  position  in  a  sentence 
and  relative  structural  height  in  syntax  and  LCS  trees.  Thus  1  in  the  LCS 
corresponds  to  the  ag(ent)  thematic  role  and  2  to  th(eme),  since  agents  usually 
precede  themes  and  occur  higher  in  the  syntactic  tree.  That  is,  in  a  sentence  with 
an  agent  and  a  theme,  typically  the  agent  will  be  the  subject  and  the  theme  the 
object,  and  both  will  precede  other  arguments. 

The  LCS  for  the  above  grid  (simplifying  irrelevant  details)  is  given  below; 
agent  =  thing  1,  theme  =  thing  2,  source  preposition  =  thing  3,  source  com¬ 
plement  of  the  preposition  =  thing  4,  goal  preposition  =  thing  5,  goal  com¬ 
plement  of  the  preposition  =  thing  6.  The  *  markers  indicate  where  arguments 
are  instantiated. 

(cause  (*  thing  1) 

(go  loc  (*  thing  2) 

((*  to  5)  loc  (thing  2)  (at  loc  (thing  2)  (thing  6))) 

((*  irom  3)  loc  (thing  2)  (at  loc  (thing  2)  (thing  4)))) 

( ! !-ingly  26) ) 

Thematic  grids  represent  multiple  structures;  additionally,  verbs  in  a  lan¬ 
guage  can  take  more  than  one  thematic  grid.  For  example,  verbs  like  fill,  carpet, 
cloak  and  plug  allow  the  following  grids,  among  others;^ 

(9)  _ag_th,mod-poss (with)  Derek  filled  the  bucket  with  water. 

(10)  _mod-poss_th  The  water  filled  the  bucket. 

Other  verb  classes  may  take  some  of  these  grids  but  not  others.  Verbs  like  in¬ 
scribe,  mark,  sign,  stamp  take  the  former,  but  not  the  latter,  for  example;  She 
signed  his  yearbook  with  her  name,  but  not  *His  name  signed  her  yearbook. 
The  grids  therefore  group  verbs  by  “semantic  structure”  (Levin  and  Rappa¬ 
port  Hovav,  1995).  In  contrast  to  “semantic  content” — the  idiosyncratic  aspect 
of  verb  meaning — semantic  structure  determines  syntactic  patterning  within  and 
across  languages  (Dorr  and  Oard,  1998;  Dorr  and  Katsova,  1998;  Grimshaw, 
1993;  Pinker,  1984;  Pinker,  1989). 

^  The  mod-poss  indicates  a  “possessed”  item,  paraphraseable  by  have:  The  hole  has 
water  (in  it). 


Most  importantly  for  our  system,  the  assignment  of  a  set  of  thematic  grids 
to  a  verb  class  allows  us  to  create  the  interlingual  LCS  structures  automatically 
(Dorr  et  ah,  1995).  Furthermore,  in  selecting  grids  for  creating  LCS  entries  for 
a  new  language,  we  leverage  the  fact  that  semantic  structure  overlaps  across 
languages  to  a  large  degree.  The  task  of  creating  the  grids  is  therefore  reduced  to 
automatic  generation,  as  described  in  Section  4,  followed  by  manual  correction 
to  eliminate  inappropriate  grids  (along  with  other  modifications,  described  in 
Section  1  above).  Before  we  describe  the  automatic  process,  we  first  describe 
some  of  the  pre-processing  required  to  extract  appropriate  verbs. 

3  Verb  Selection 

The  assignment  of  thematic  grids/LCS  structures  to  verbs  is  one  step  in  the 
creation  of  a  lexicon  from  a  large  (600k  entries)  machine  readable  Chinese- 
English  dictionary.  The  dictionary  was  compiled  by  hand,  by  the  Chinese- English 
Translation  Assistance  (CETA)  group  from  some  250  dictionaries,  some  general 
purpose,  others  domain-specific  or  bilingual  (Russian-Chinese,  English  Chinese, 
etc.).  The  CETA  group,  started  in  1965  and  continuing  into  the  present  decade, 
was  a  joint  government-academic  project.  The  machine-readable  version  of  the 
CETA  dictionary,  Optilex,  licensed  from  the  MRM  corporation,  Kensington, 
MD. 

CETA  contains  some  information  extraneous  to  our  purposes.  Some  of  the 
250  resources  used  to  create  the  dictionary  were  very  domain-specific,  including, 
for  example.  Collier’s  North  China  Colloquial  Collection,  a  publication  listing 
many  regionalisms  not  observed  anywhere  else  in  China,  and  the  Faxue  Cidian, 
a  dictionary  of  legal  terms  from  Shanghai.  We  eliminated  many  archaic  and 
technical  verbs  by  eliminating  verbs  identified  by  CETA  as  derived  from  these 
sources.^ 

Even  after  archaic  or  idiosyncratic  sources  were  eliminated,  entries  varied 
widely  in  specificity,  from  the  general  verbs  (and  other  words)  to  the  extremely 
specific,  as  the  examples  below  show,  given  with  the  Pinyin,  definition,  and 
simplified  character  representation  from  CETA. 


^  BF  Chinese-English  Dictionary  1978,  BE  same  as  E  but  Chinese-Chinese  1978,  AR 
Atlas  of  the  PRC  1977  (for  Chinese  placenames),  AO  Gazetteer  of  the  PRC  (also  for 
Chinese  placenames),  BQ  extra  new  entries  from  the  hrst  two  above  BE  and  BF  CJ 
standardized  FBIS  translations  of  Chinese  communist  terms,  CM  specialized  terms 
extracted  from  Mao’s  works,  CU  Hong  Kong  glossary  of  Chinese  communist  terms, 
EJ  1981  idiom  dictionary,  EK  1982  idiom  dictionary,  FA  Foreign  Exchange  terms 
1963,  IP  International  political  economics  glossary  1980,  IQ  Beijing  social  sciences 
academy  economics  terms  1983,  NA  world  place  names  1981,  PP  primary  political 
economics  glossary  1956,  TM  McGraw-Hill  general  scientihc  and  technical  dictionary 
1963,  VF  Lin  Yutang’s  dictionary  1972,  VT  1973  Beijing  foreign  exchange  glossary, 
WB  Liang  Shih-ch’iu’s  traditional  dictionary  1973,  YG  Stanford’s  dictionary  of  Chi¬ 
nese  communist  terms  1973. 


(11)  po4_shi3  compel 

(12)  po4_shi3  force 

(13)  benl_pao3  run 

(14)  zou3  walk 

(15)  chul_kou3  speak®  tB  P 

(16)  bil^ongl  force_the_sovereign_to_abclicate  IHb 

(17)  benl_zou3_xiangl_gao4  run_arouncl_spreacling_the_news 

(18)  cal_la5_zhe5_zou3  walk_clragging_one’s_feet  mnm 

(19)  chuil_xul  speak_in_favor_of_somebocly_in_exaggeratecl_terms 


Although  CETA  is  large,  and  in  some  ways  exhaustive,  some  information 
required  by  our  machine-translation  lexicon  is  not  directly  encoded,  notably 
part  of  speech.^  We  identified  the  verbs  by  a  simple  process.  We  parsed  the  DEE 
(gloss)  field  in  the  CETA  entries  from  the  selected  sources.  If  the  English  glosses 
began  with  the  infinitival  ‘to’,  the  whole  entry  was  used  to  generate  as  many 
new  verb  entries  as  there  are  verbs  in  the  DEE  field.  As  an  example,  the  excerpt 
from  following  entry  has  four  subentries  in  its  DEE  field.  PY  gives  the  Pinyin 
representation.® 

PY;  bianl  ta4 

DEE;  1.  to  whip,  to  flog  2.  <fig>  to  chastise,  to  castigate 

After  processing,  each  definition  has  a  single  subsense  entry,  i.e.  there  are  four 
subentries. 


4  Pairing  Verbs  and  Thematic  Grids 

4.1  English  glosses 

Eor  the  Arabic  and  Spanish  lexicon,  we  created  candidate  thematic  grids  by 
pairing  target  language  words  with  the  thematic  grids  associated  with  their 
English  gloss,  with  manual  correction  over  a  period  of  two  weeks.  We  did  the 
same  initial  step  for  Chinese,  as  well.  However,  as  described  above,  the  senses 
had  already  been  separated  into  different  subentries.  We  thus  had  candidate 
thematic  grid  sets  for  each  sense  of  a  given  verb. 

The  file  containing  Chinese  grids  was  created  by  first  matching  the  main  verb 
of  the  English  glosses  to  one  or  more  entries  in  the  English  grids  file.  Separating 

^  As  in  ‘to  speak  ill  of  someone.’  This  meaning  is  the  hrst  listed,  althongh  it  is  less 
common  than  others,  inclnding  ‘exit’,  as  in  exit  signs  (John  Kovarik,  p.c.). 

^  In  fact,  part  of  speech  was  not  inclnded  in  Chinese  dictionaries  at  all,  nntil  the  mid- 
80s  (John  Kovarik,  Jin  Tong  p.c.);  how  and  whether  to  do  it  is  still  controversial 
(http :  //linguistlist .  org/issues/9/9-1186  .html). 

®  CETA  inclndes  other  helds  not  listed,  inclnding  HWT  and  HWS  encoding  traditional 
and  simplihed  characters,  STC  for  the  Standard  Telegram  Code,  and  REF  for  the 
dictionaries  the  entry  came  from. 


polysemous  entries  is  an  aid  to  this  process,  since  not  all  grids  are  associated 
with  all  verb  senses.  For  example,  a  wide  range  of  grids  is  available  for  the  Run 
verbs.  The  first  numbers,  again,  are  classes  from  Levin  (Levin,  1993).  Numbers 
less  than  9  are  classes  not  found  in  Levin  that  were  created  automatically  (Dorr, 
1997b). 

(20)  26.3  _ag  chi2  run 

(21)  26.3  _ag_ben_th  chi 2  run 

(22)  26.3  _ag_th,ben(f or)  chi 2  run 

(23)  47.5.1  _ag,mod-loc()  chi2  run 

(24)  47.5.1  _loc_th  #  chi2  run 

(25)  47.5.1  _thJLoc()  chi2  run 

(26)  47.7  _th_goal()  chi2  run 

(27)  47.7  _th_src(from)_goal(to)  chi2  run 

(28)  51.3.2  _ag  chi2  run 

(29)  51.3.2  _th,src()  ,goal()  chi2  run 

In  contrast,  a  relatively  small  number  is  available  for  other  meanings  of  this 
character. 

(30)  31.2  _exp_perc  ,mod-poss(in)  chi2  support 

(31)  47.8  _th_loc  #  chi2  support 

In  previous  work,  all  grids  were  associated  with  a  single  entry  and  the  checker 
was  presented  with  a  “bag  of  grids” ,  without  a  link  to  a  specific  meaning.  Since 
manual  separation  of  senses  was  necessary,  the  likelihood  of  human  error  was 
high;  checkers  would  delete  or  retain  grids  depending  on  which  sense  of  the  verb 
they  had  in  mind.  In  the  case  at  hand,  it  turns  out  that  chi2  means  ‘run’, 
as  in  ‘run  a  business’  or  ‘run  a  machine’,  whereas  the  theta  grids  were  derived 
from  the  motion  verb  run  in  English.  Should  the  grids  prove  inappropriate  in 
the  manual  verification  stage,  they  can  be  deleted  without  affecting  entries  with 
other  meanings. 

4.2  Automatic  Modification  of  Candidate  Grids 

Each  thematic  grid  in  the  initial  candidate  set  describes  the  argument  structure 
for  the  head  verb  of  the  gloss,  in  some  usage  of  that  (English)  verb.  To  construct 
appropriate  LCSs  for  the  Chinese  verb,  these  grids  must  be  manually  checked  and 
modified  where  necessary.  We  have  further  parsed  the  DEE  field  to  automatically 
make  certain  modifications  that  in  earlier  work  had  been  done  by  hand. 

Eor  instance,  the  candidate  set  for  the  Chinese  verb  in  (16)  above,  glossed  ‘to 
force  the  sovereign  to  abdicate,’  contains  the  grid  _ag_th,prop(to) ,  because  the 
English  verb  force  takes  an  agent,  theme  and  optional  propositional  complement. 
After  parsing  the  gloss  into  subphrases,  we  can  posit  that  ‘the  sovereign’  is  theme, 
and  ‘to  abdicate’,  the  propositional  element.  On  the  assumption  that  a  gloss  of 


this  sort  implies  that  theme  and  propositional  element  are  part  of  the  Chinese 
verb  meaning  and  not  expressed  as  overt  complements,  the  grid  is  reduced  to 
_ag;  ‘the  sovereign’  and  ‘to  abdicate’  are  set  aside,  to  be  inserted  directly  into 
the  LCS  for  the  Chinese  entry.  Thus,  for  hand  checking,  we  construct  a  grid  that 
appears  like; 

(32)  002  _ag  laa  bi  1  -gong  1  force_the_sovereign_to_ab  dicate 
(th  =  sovereign)  (prop  =  to_abdicate) 

Similarly,  the  following  Chinese  word  receives  the  grid  shown  in  (9),  but  with 
the  possessional  modifier  mod-poss  (with)  lexicalized  by  the  verb  itself,  and  thus 
removed  from  the  grid; 

(33)  9.8  _ag_th  m±  tian2_tu3  fill_in_with_earth  (mod-poss  =  earth) 

The  underlying  intuition  is  that  verbs  that  incorporate  thematic  elements 
in  their  meaning  would  not  allow  that  element  to  appear  in  the  complement 
structure;  *filLm_with_earth  with  gravel,  cf.  English  *I  sodded  my  lawn  with  ivy^ . 

The  matching  of  gloss  verb-complements  to  thematic  roles  is  made  as  fol¬ 
lows.  We  first  parsed  the  gloss  with  simple  context-free  phrase-structure  rules, 
to  retrieve  a  flat  structure,  consisting  of  the  V  and  a  list  of  complements;  NPs, 
PPs,  clauses  like  ‘to  abdicate’,  and  predicate  adverbs  or  adjectives,  like  ‘weary’ 
in  ‘to  be  weary’.  PPs  headed  with  ‘of’  were  attached  low  and  not  considered  as 
a  VP  complement,  e.g.  give  [an  explanation  of  the  situation]. 

Information  in  parentheses  was  ignored.  Thus,  had  the  gloss  above  been  ‘to 
force  (i.e.  the  sovereign)  to  abdicate’,  we  would  have  assumed  that  the  Chinese 
verb  required  a  theme  argument  (like  ‘the  sovereign’),  and  the  grid  would  have 
been  _ag_th  instead  of  just  _ag.  Parentheses  do  contain  some  apparently  impor¬ 
tant  material.  For  example,  there  is  a  gloss  ‘to  kill  (or  catch)  a  tiger’,  which 
appears  to  condense  two  different  senses.  However,  this  usage  of  parentheses  was 
mostly  found  in  the  sections  of  CETA  we  suppressed.  A  series  of  nouns  was 
considered  a  single  NP,  as  in  an4  jian3;  DEE;  to  investigate  [a  law  case]. 

Having  split  the  gloss  into  its  thematic  parts,  we  then  match  the  PPs  to 
thematic  roles  that  specify  the  same  head  as  the  PP,  and  match  propositional 
elements  that  have  matching  prepositions  or  particles  (i.e.  the  ‘to’  in  ‘to  abdi¬ 
cate’).  Some  grids  specify  roles  with  no  particular  preposition,  in  which  case  we 
heuristically  assign  roles  according  to  this  table; 

from;  src  (source)  or  instr 
for;  purp  (purpose) 
with;  instr  or  mod-poss 
without;  mod-poss 
into,  to  against;  goal 
under,  around,  along;  mod-loc 

^  We  are  ignoring  ‘cognate  objects’,  as  in  I  sodded  my  lawn  with  the  best  sod  available 
(Macfarland,  1995) 


Adverbs  become  manner  components,  in  positions  where  they  typically  mod¬ 
ify  the  verb  (‘to  blindly  worship  foreign  things’),  rather  than  an  adjective  (‘to 
be  seriously  ill’).  The  adverbial  manner  components  become  part  of  the  LCS, 
if  the  entry  passes  through  the  hand  inspection  phase.  A  gloss  that  ends  with 
a  dangling  preposition  is  taken  as  a  sign  that,  where  the  English  verb  takes  a 
PP,  the  Chinese  verb  fills  the  same  role  with  a  bare  NP  argument.  Thus  the 
parentheses  are  removed  from  the  grid  for  that  role  (see  Section  2).  Bare  noun 
phrases  are  matched  to  non-prepositional-phrase  thematic  roles.  Predicate  adjec¬ 
tives  match  pred,  an  identificational  predicate — in  this  case,  naming  a  property. 
Any  material  in  the  gloss  not  matching  anything  in  the  thematic  grid  is  kept  for 
incorporation  into  the  LCS  as  a  modifier. 

In  this  manner,  11360  distinct  theta  role  assignments  were  created.  In  some 
cases  the  original  theta-roles  list  actually  becomes  empty,  in  which  case  it  appears 
as  _0,  the  thematic  grid  for  verbs  with  no  semantic  arguments,  such  as  rain  in 
English  It’s  raining. 

After  we  saturate  the  relevant  components  of  the  thematic  grids,  we  use  the 
filled  grids  to  reduce  the  candidate  set  of  grids.  If  the  set  of  theta  roles  lexicalized 
by  a  Chinese  verb  sense  for  one  candidate  grid  (which  may  be  the  empty  set)  is 
a  proper  subset  of  that  for  another  grid  of  that  verb  sense,  then  the  smaller  grid 
is  discarded,  resulting  in  an  a  11%  reduction  in  the  number  of  entries  that  need 
to  be  hand-checked.  Thus,  if  there  were  a  thematic  grid  _ag_th  generated  for  ‘to 
force  the  sovereign  to  abdicate’,  it  would  be  discarded  in  favor  of  the  grid  above. 

Similarly,  the  seven  candidate  grids  for  ‘to  serve  as  a  guide’  reduce  to  one, 
since  only  one  could  incorporate  the  predicate  with  ‘as’,  that  from  Levin  class 
29.6,  which  includes  verbs  like  act,  behave,  and  pose  as  well  as  serve: 

(34)  Entry  HWS; 

PY;  chongl  xiang4  dao3 

DEE;  ‘to  serve  as  a  guide’ 

(35)  Retained; 

29. 6. b  _th._pred(as) 

(36)  Suppressed; 

13.1  _ag^oal_th  (e.g.  We  served  them  food) 

13.1  _ag_th^oal(to)  (e.g.  We  served  food  to  them) 

13.4.1  _ag_th,mod-poss (with)  (e.g.  I  served  him  with  a  warrant) 

13.4.1  _ag_th,goal(to)  (e.g.  We  served  a  warrant  to  them) 

54.3  _ag_thJLoc()  (e.g.  We  serve  Ilf  people  in  this  restaurant) 

54.3  _th_poss  (e.g.  This  restaurant  serves  Ilf  people) 


5  Results 


Using  the  process  described  above,  15565  thematic  grids  were  eliminated,  repre¬ 
senting  11%  of  the  total  number  of  candidates.  We  began  the  process  of  manual 
evaluation  of  the  theta  grids,  beginning  with  the  verbs  in  10  articles  from  Xinhua, 


comparable  to  the  Wall  Street  Journal  in  content;  124  grids  were  suppressed  for 
47  verbs  (29  classes),  leaving  3041  grids  for  263  verbs  (characters,  rather  than 
definitions).  A  set  of  51  theta  grids  were  generated  for  the  13  verbs  in  ten  sen¬ 
tences  from  these  articles.  Chinese  speakers  deleted  17  grids,  or  33.3%.  Although 
these  results  are  a  tiny  subset  of  the  full  verb  lexicon,  this  figure  compares  fa¬ 
vorably  to  the  53%  deletion  required  of  the  Spanish  data.  Importantly,  none  of 
the  relevant  grids  had  been  discarded  by  our  algorithm.^ 

The  fact  that  we  parsed  the  complement  structure  in  the  sub  entries  alerted 
the  language  experts  (John  Kovarik  and  Mary-Ellen  Okurowski,  both  from  the 
Department  of  Defense  (DOD),  and  Ron  Dolan  from  the  Library  of  Congress) 
to  additional  senses  that  should  be  eliminated  from  the  lexicon,  e.g.  those  not 
properly  considered  verbal.  Furthermore,  although  we  used  only  the  most  general 
sources,  all  the  dictionaries  included  entries  from  both  classical  and  colloquial 
Chinese.  Only  the  latter  is  used  in  our  domain.  Additionally,  the  classical  entries 
are  often  archaisms  and  figurative  uses  that  would  likely  not  have  the  same 
thematic  structure,  for  example,  the  meaning  ‘to  shelve’  derived  from  a  character 
meaning  ‘to  push  (down)’  (PY  an4).  In  addition,  the  syntactic  structure  assigned 
to  the  gloss  demonstrated  that  some  of  the  entries  glossed  as  verbs  are  more 
appropriately  treated  as  prepositions  or  prepositional  phrases.  Removing  old  and 
syntactically  incorrect  entries  resulted  in  a  further  40%  reduction  for  verb  senses 
in  the  ten  articles.  Since  we  have  been  generating  an  average  of  3.3  thematic 
grids  per  sense,  we  have  decided  to  do  preprocessing  before  generating  the  other 
candidate  grid  sets. 

6  Conclusions  and  Future  Work 

We  have  described  a  procedure  for  automatically  reducing  the  amount  of  manual 
checking  necessary  for  building  the  thematic  grid  structure  for  verbs  in  Chinese. 
We  anticipate  that  this  procedure  will  save  us  time  over  our  original  checking 
procedure.  The  latter,  in  turn,  reduced  the  amount  of  time  required  to  create 
thematic  structure  from  6  person  months  (for  a  lexicon  with  60k  entries  and  3- 
4k  verbs)  to  approximately  two  weeks  of  hand  verification.  The  time  savings  for 
our  project  is  even  more  imperative,  since  we  expect  to  have  almost  double  that 
size  in  verbs  alone,  even  after  removing  inappropriate  entries.  The  procedure 
described  in  this  paper  provides  further  streamlining  for  the  process  of  acquiring 
large-scale  lexica  for  NLP  applications  with  non-optimal  on-line  resources. 

In  addressing  the  polysemy  problem  in  this  context,  we  have,  as  a  by-product, 
produced  a  sense-to-syntax  mapping,  tying  a  verb  sense/character  pair  to  a  set 
of  grids  representing  syntactic  as  well  as  semantic  structure.  This  mapping,  in 
turn,  could  be  used  not  only  for  machine  translation,  but  for  segmentation  and 
word  sense  disambiguation  algorithms  for  Chinese. 

^  The  copular  grid  for  the  verb  shi4  was  added  to  the  set,  using  a  grid  assigned  to 
other  copular  verbs,  namely  wei2  and  zuo4-  Somewhat  surprisingly,  the  absence 
of  the  copular  grid  in  our  candidate  set  resulted  from  an  absence  in  CETA  of  the 
copular  meaning  for  that  verb. 
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